wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Tokenize sentences in Latin and Devanagari scripts using wink-tokenizer
. Some of it's top feature are outlined below:
-
Support for English, French, German, Hindi, Sanskrit, Marathi and many more.
-
Intelligent tokenization of sentence containing words in more than one language.
-
Automatic detection & tagging of different types of tokens based on their features:
- These include word, punctuation, email, mention, hashtag, emoticon, and emoji etc.
- User definable token types.
-
High performance – tokenizes a typical english sentence at speed of over 2.4 million tokens/second and a complex tweet containing hashtags, emoticons, emojis, mentions, e-mail at a speed of over 1.5 million tokens/second (benchmarked on 2.2 GHz Intel Core i7 machine with 16GB RAM).
Installation
Use npm to install:
npm install wink-tokenizer --save
Getting Started
var tokenizer = require( 'wink-tokenizer' );
var myTokenizer = tokenizer();
var s = '@superman: hit me up on my email r2d2@gmail.com, 2 of us plan party🎉 tom at 3pm:) #fun';
myTokenizer.tokenize( s );
s = 'Mieux vaut prévenir que guérir:-)';
myTokenizer.tokenize( s );
s = 'द्रविड़ ने टेस्ट में ३६ शतक जमाए, उनमें 21 विदेशी playground पर हैं।';
myTokenizer.tokenize( s );
Documentation
Check out the tokenizer API documentation to learn more.
Need Help?
If you spot a bug and the same has not yet been reported, raise a new issue or consider fixing it and sending a pull request.
About wink
Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS. The code is thoroughly documented for easy human comprehension and has a test coverage of ~100% for reliability to build production grade solutions.
Copyright & License
wink-tokenizer is copyright 2017-21 GRAYPE Systems Private Limited.
It is licensed under the terms of the MIT License.